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Abstract 



In this article we propose an improvement on the sequential updating and greedy search (SUGS) algorithm 
|20| for fast fitting of Dirichlet process mixture models. The SUGS algorithm provides a means for very fast 
approximate Bayesian inference for mixture data which is particularly of use when data sets are so large that many 
standard Markov chain Monte Carlo (MCMC) algorithms cannot be applied efficiently, or take a prohibitively 
long time to converge. In particular, these ideas are used to initially interrogate the data, and to refine models 
such that one can potentially apply exact data analysis later on. SUGS relies upon sequentially allocating data 
to clusters and proceeding with an update of the posterior on the subsequent allocations and parameters which 
assumes this allocation is correct. Our modification softens this approach, by providing a probability distribution 
over allocations, with a similar computational cost; this approach has an interpretation as a variational Bayes 
procedure and hence we term it variational SUGS (VSUGS). It is shown in simulated examples that VSUGS can 
out-perform, in terms of density estimation and classification, the original SUGS algorithm in many scenarios. 
In addition, we present a data analysis for flow cytometry data, and SNP data via a three-class dirichlet process 
mixture model illustrating the apparent improvement over SUGS. 

Key-words: Approximate Bayesian Inference; Mixture Modelling; Variational Bayes; Density Estimation. 

1 Introduction 

The demands of fitting models to large data-sets have exploded over the last decade. Increasingly complex data 
sets are available, which has placed demands on statisticians to develop realistic models to represent these data. 
Inevitably, for many classes of models, this places a further emphasis on being able to fit such models accurately 
and in a reasonable time-frame. 

In this article, we consider fast Bayesian statistical inference for Dirichlet process mixture (DPM) models pi,[T5]. 
This particular class of models have proven to be popular in the literature as a tool for both clustering and density 
estimation and there are a wide variety of elegant MCMC and sequential Monte Carlo algorithms; see e.g. [131 ITU] , 
Such algorithms provide exact inference from DPM models, but can be very computationally demanding when 
trying to analyze extremely large data-sets and even more, exact statistical inference from mixtures is notoriously 
difficult; see [llj. As mentioned above, this issue often leads to researchers resorting to approximate inference to 





browse or interrogate the data, so as to refine model specifications for an exact analysis; one particular important 
and interesting method in this direction is the SUGS algorithm. 

The SUGS algorithm is a procedure for fast approximate fitting of DPM models. It relies on an approximation 
of the exact posterior distribution on the allocation of data to components and the component specific parameters, 
by sequentially adding data-points to the model. These data are allocated to a given mixture component and this 
allocation is frozen and taken as the truth when updating the posterior for new data points. The method can be 
sensitive to the ordering of the data, but [2DJ provide procedures to select this ordering in a systematic way. In this 
paper, we reinterpret the SUGS algorithm within a variational Bayes framework. This allows one to derive different 
approximations of the posterior distribution. 

In particular, our interpretation does not mean that one needs to allocate data to a cluster and, instead, provides 
a probability distribution on these allocations; this is done at a minor increase in computational cost. The advantage 
of this generalization, which we call VSUGS, is apparent when one fits the approximation to data whose components 
are close in some sense. In this scenario, we have consistently found (and as illustrated in Section [6| that VSUGS 
outperforms SUGS in a variety senses; this is important as, when the mixture components are highly separated, such 
initial browsing or interrogation of the data is less important. Moreover, our variational approximation provides 
a lower-bound on the log-marginal likelihood; we empirically find that this can be used as a technique for model 
selection (e.g. selecting the order in which the data arrive) which did not seemingly work well in |20) . 

This article is structured as follows. We begin with a motivating example in Section [2] In Section [3] we give 
a basic summary of DPM models. In Section |4] we discuss SUGS and in Section [5] our generalization VSUGS. In 
Section [6] we give numerical examples; both a simulation study and real data analyses associated to flow cytometry 
data and SNP data via a three-class dirichlet process mixture model. In Section[7]we conclude the article, discussing 
avenues for future work. 

2 Motivating Application 

A motivating application for us is the problem of genotyping single nucleotide polymorphisms (SNPs) from SNP 
genotyping microarray data ^9]. Figure |6] illustrates an example dataset. The statistical problem is to characterise 
the three genotype classes AA, AB and BB and classify each data point into one of these three classes. This 
is a straightforward three-way classification problem that can be approached using hierarchical mixture modelling 
where, for example, the class-conditional densities are modelled using multivariate Normal or Student t-distributions. 
Although, these models work well in practice, it is clear that the class-conditional densities are not Normal (or 
Student). We can obtain increased accuracy through the use of semi-parametric models for the class-conditional 
densities using Dirichlet Process Mixtures. However, the size of the data sets presents a massive challenge for 
this type of modelling approach. For a single experiment (individual), modern genotyping microarray produces 
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300,000-5,000,000 two-dimensional measurements. Each study may consist of hundreds to thousands of individuals. 
The data sizes here prohibit the use of Monte Carlo inference and motivate approximate approaches that are able 
to scale to the size of problems encountered. 



3 Dirichlet process mixture model 

Consider a Dirichlet process mixture model of the form, for i € N: 

Y,\e,''-^^- p{-\e,) e,^p p^DP{aPo) (i) 

where Y, e Y C R'^y , 0, e Q C W'^" are observation specific parameters, P{-\9i) is a conditional probability which 
admits a density p{-\9i) w.r.t. a single dominating cr— finite measure for each 9i (which is often Lebesgue), P is 
an unknown mixing distribution, and P ^ DP{aPf)) indicates that the prior for P is a Dirichlet process |B] with 
precision parameter a £ M+ and base measure Pq. Pq is known and we also consider a to be fixed. We remark 
that, in connection to subsequent methodology to be presented, |20] consider a way of handling unknown a which 
can also be used in all the extensions we consider but for simplicity we do not consider this below. 

A well known property of the Dirichlet process is that a distribution drawn from it will put all its mass on a 
countable set of points. Following the notation of [20] we will write 9 — {9j}°^^ for the set of distinct values in the 
sequence 9 = {6j}^i where points in 9 are labelled according to their order of appearance in 9. Next, let 6i — j if 
9i — 9j and write S — {Sj}J^i. Using the Polya urn characterization of the Dirichlet process [3] we can rewrite (jlj 
in the form 

Y,\S,,9s,'-'^^- P{-\9sJ p{5,9)=p{5)p{9) 

where p{9) — Y\^=iPQ{9j), Po(') is the probability density associated to Pq and p{5) is as follows, (writing Si-i = 
{Si, Si), with the convention Si;o is the null vector), for i e {2, 3, . . . } 

fx ■\x ^ J a+i-i i <= {!' • • • ' ^^i} , , 

P[Ot = j\Sl:t-l) = < (2) 

where p{5i = 1) = 1, n^*"* = Card({(5,„ : (5m = J, 1 < 'ti < * ~ 1}), and rii ~ 7ii((5i:i-i) is the maximum value in 
(i.e. the number of components "seen" in the data up to to time i — 1). This representation of the model 
where the unknown measure P is integrated out is important for many Monte Carlo sampling schemes for fitting 
DP mixtures El [H] . 

Later we will work with a truncated Dirichlet process mixture model, which is often convenient for computations 
[10] . Such truncations are based on the stick breaking representation of the Dirichlet process [IH] ■ Suppose we limit 
the number of distinct values appearing in the sequence {P^Y^^x to an upper truncation limit T > 1. Then 
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generalizing the Polya urn representation we can consider the truncated Dirichlct process with p{S, 6) = p{d)p{9) 
where nowp(6') = nJ=i-Po(^j) andp((5) is defined recursively hy p{Si = 1) = 1 and similarly to (jij) (see, for example, 
[ini Section 2.2.2]) for i e {2, 3, . . . } 



We note that truncations have also been used in the context of variational approximations for Dirichlet process 
mixtures U although these authors consider the truncation point as a variational parameter without truncating 
the original model. The algorithm of [4 , although related to ours, is not a sequential algorithm however - here 
we are interested in very fast sequential algorithms related to the SUGS method of [20]. [20] compare their 
approach with a variety of other fast computational methodologies for Dirichlet process mixtures, and show that 
their algorithm is competitive with other fast approximation methodologies. In this work we focus only on comparing 
our new approach with the original SUGS algorithm, and refer the reader to [20 for information about the relative 
performance of alternative approximations to SUGS. 





4 The SUGS algorithm 



SUGS is a recursive algorithm that takes at time i — 1 an estimate Si;i-i oiSi-^^i and an approximation of p{9\yi-i^i) 



and produces an estimate Si-a of Si;i and an approximation of p{9\yi-i). To start the recursion we use 6i = 1 and 



p{9\yi) — p{9\ Ij/i, 8\) Iljoi Po(^j)- Here and in what follows a term &i in the conditioning means (5.; = (5^ and similarly 
for Suppose at time i — 1 we have an estimate b\;i-\ of b\;i-\. Consider the posterior distribution p((5i, 9\y\;i) 



and approximate this by p{8i,Q\y\;i^b\.i-\) . That is to say, we initially consider our approximation of p{6i^9\y\;i), 



Vi{.k,^\yv.i), as 



Vi{bi,(^\y\:i) oc p{6i,9\yx:i-\-,h:i-\)v{yi\Q&,) 




Using this approximation and integrating out 9 




The SUGS algorithm sets: 



k : 



= argmax^^g{^_ p{5^\yv.^) 



(4) 



4 



and then one replaces 



by 



to form the approximation: 



PiiSi,0\yi:t) =I{5j((5,;) I p(6'j|5i:,_i,2/i,,_i) > p{9s.\Si.,^,yi..i). (5) 

In this approximation the components 9j are independent for different j, the posterior for 9j for j ^ Si is unchanged 
from time i — 1, and the posterior for 9g is updated by assuming that Si = Si so that yi represents an observation 
from this mixture component. If the mixture components are from the exponential family and conjugate priors 
are used, the densities such as yi:j) can be calculated in closed form and sufficient statistics are updated 

recursively, leading to a very efficient update. 

|20j consider a number of further innovations in their algorithm. First, since the fitting algorithm is sequential 
and there is a dependence of the fit on the ordering of the data, they suggest running their algorithm for different 
random orderings and then choosing the best according to a pseudo-likelihood criterion. Secondly, for model 
comparison they suggest the approximation 

and show that this crude approximation can be useful for tasks such as comparison of parametric and nonparametric 
models. Thirdly, they suggest a way of dealing with unknown a in the Dirichlet process prior. 



5 An improvement of the SUGS algorithm 

Here we suggest a simple improvement of the SUGS algorithm which we call VSUGS (variational SUGS). We begin 
with a brief introduction to variational Bayes (VB) methods. 

5.1 Variational Bayes 

Suppose we have a parameter ^ G S C M.'^e and data y, p(^) is the prior density, p{y\£,) the likelihood and p{£,\y) 
denotes the posterior density w.r.t. Lebesgue measure (which we use for presentational purposes only). In VB [T^[^ 
we split ^ into blocks ^ = ^i:^ — (^i, ^k), £,j G Sj, with Si x • • • x = S and seek to find a good approximation 
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to p{£.i:k\y) of the form 

k 

where each q{Cj) is a probabiUty density w.r.t. the appropriate dimensional Lebesgue measure. Given known 
probabihty densities for q{£,j), j ^ i, the optimal choice for q{£,i) for minimizing the KuUback-Leibler divergence 

KL{q\\p) J log (6) 

is 

« exp{E_,-(j,)[logp(S)p(2/|S)]} (7) 

where E_q(j.)[-] denotes expectation w.r.t. Y[j=^i'li^j)- Hence there is a gradient descent algorithm for minimizing 
([7]) based on choosing initial values for the factors in q(r]) and then iteratively updating each term according to ([t]). 
Minimizing Q is equivalent to maximizing 

m /log( ^(^^|g^f^-^) )?(er.)^6. (8) 

and ([s]) is a lower bound on the log marginal likelihood \ogp(y) where p(y) — J p{£,i:k)piy\£,i:k)dS,i:k- p{y) is a 
key quantity in Baycsian model selection and the lower bound is tight, L{q) ~ logp(y), when q{£,i:k) — p(Ci:fc|y)- 
Generally L{q) is often used as an approximation to logp(y) for model selection in the VB framework. 

5.2 The VSUGS algorithm 

As we have seen in ^ the SUGS algorithm recursively approximates p{5i;i^ d\yi-i). Considering this in a variational 
framework, suppose we have an approximation to the posterior of the form 

{n*-i('5,)||n*-i(^^)|- 

In the above expression, we omit certain conditionings (as will become apparent below) to reduce the subsequent 
notational burdens. We will suggest a way to update this approximation of p((5i:i, using variational ideas. 

The approximation will be of the form {11^=1 '?i('^j)}{njeN start the recursion with qi{5i = 1) = 1, 
=p(0i|yi,<5i - 1), =po(e,), J e {2,3, . . .}. 

The idea is to make a particular fixed choice at time i for qi{5i) and not to revisit that choice at future times. 
That is, the solution to the (partial) variational optimization at time i — 1 is used to initialize the optimization at 
time i. The original SUGS algorithm chooses qi{5i) — I|^.j((5i) where Si is defined in Q but it is possible to make a 
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better choice than this without sacrificing the attractive computational properties of the original SUGS algorithm. 
In particular, at time i, we set qi{Sj) — qi-i{5j), j G {1, — 1} and choose 



qiiSi^j) = Qij J qi-ii0s,)p{yi\9si)d0si (9) 
for j e {1, . . . , z A T} where T is a truncation point for the number of mixture components and 



q^3 



qi{Si = j) will be chosen as an approximation to p{Si\yi-i). To provide some intuition for this selection of qi{Si ~ j), 
we remark that 

p{6i\yi:i) oc p{Si\yi.i-.i)p{y.i\Si,yi.,i^i) 

= p{S,\yi..,-i) J p{y,\esMOsAyv.-i)d05^ (10) 

Next, note that 

P{Si\yi:i-l) = Eb((Ji|yi:i-l,(5l:i-l)|?/l:i-l] = E[p(<5i 

where the expectation is w.r.t. p((5i:i-i|j/i:i-i). This suggests approximating p{6i\yi-i^i) by taking the expectation 
in the above expression with respect to the variational posterior Although this approximation is still 

not easy to work with, if we we condition on = (i — 1) A T in ([s]) and then take the expectation with respect to 
gi_i((5i:i_i), we get qij. So qtj is an approximation to p{Si\yi;i-i) in |lO| ) and the term 

qi- 1 {05i )p{yi 1 5i , 9s^ )d9si 



in ^ simply approximates the integral in (lOl by replacing p(9si\yi:i~i) with the variational posterior qi-i{Os.). 
As noted earlier, the original SUGS algorithm can be placed in our framework by using qi{Si) = I^^ a "hard" 

rather than "soft" allocation to clusters which tends to result in greater under-estimation of uncertainty than in 
our approach. 

Next, using qi-i{6j) as the prior for 9j at time i for processing the data point y^, the optimal choice for qi{9j) 
is, via ([7| 

%{ej) cx $i_i(6lj)cxp ^E_j^(e^.) ^^I{fe}((5i)log(p(2/^|efc))j j 

« q^-l{0My^\(^Jf'^''='^ (11) 
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where E_^.(^g.-^ denotes expectation w.r.t. {YTh^iqi{Sh)}{Ylh^jqi{dh)} and Ti = i AT. Again, with the choice 
qi{Si) — exponential family mixture components and conjugate priors this reduces to the SUGS update. 

If the mixture components are normal then the generalized update above can be done in closed form - we will 
give details of this below. Note the attractive form of the above update. The likelihood contribution from the i*^ 
observation is split among different mixture components j according to the weight qi{5i — j) rather than assuming 
the most likely allocation as in the original SUGS algorithm. We can also use the variational lower bound ([8| to 
approximate the marginal likelihood more accurately than with \og{p{yi-i\di-i)) in the SUGS algorithm. Details of 
this are given in the next section for the case of normal mixture components. 

5.3 VSUGS for DP mixtures of normals 

|20| consider the case of DP mixtures of normals in detail. In this case, 9i — where /i^ is the mean for the 

ith component and Q is the precision and we have Yi\Si = j^Os^ M{^j,C,J^). They consider a normal inverse- 
gamma prior for Oj^ Po{9j) = po{fJ'j\'m,i'(^J'^)pQ{Q\a,b) where po{iij\m, is a normal density with associated 
distribution with mean p and variance i^C]'^ and p and are known hyperparameters, and pQ{(^j\a,b) is a gamma 
density with known parameters a and b. 

Our VSUGS algorithm results in qi{Oj) being normal inverse-gamma also, qi{0j) = qi{p.j\p'j'\iyj^\~^)qi{Q\a^j\b^j^) 
where qi{pj\p^j'\ fj^^j'^) is the normal density with mean p^*"* and variance and 5i(Cjl^j*''i ^j*"*) is a gamma 

density with parameters aj*'' and 6^'^ The parameters i^j^^ a^*^ and b'^j ^ are updated recursively by (c.f. [20l 
p. 204]) 

U) _ qi{S,=j) 

To calculate the terms qi{6i — j) in the VSUGS algorithm, we also need to evaluate the integral 

In the normal case with the priors we have chosen this integral evaluates to a t density, i2a<'"^' '^i* ^^'^j' I 
a'j' + 1)) where td{y; m, s^) denotes a t density for y with d degrees of freedom, location parameter m and 

scale parameter s. 

An approximate variational lower bound on logj3(yi:i) can also be computed recursively. We can think of the 
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posterior at stage i — 1 as the prior to be updated by the hkchhood contribution for the ith observation: 



p{Si,0\yi.,,) cx p{6^, 0\yi,i_i)p{yi\es^). 

Approximating p{Si\yi-i^i) by qij as we did previously and approximating p{6\yi-i^i) by nj=i ^-^^ calcu- 

lating the lower bound Q using these priors for the likelihood contribution p{yi\9si) gives 

= E - ) - log )) + log (r(«r'')) + (log ) - log 



- E ^^('5' - I - \ log (^?) - ^ log (2-) - \ I 

Ti T,- 

■ E ^iC*^' = log = i)) + E " -'^ log 



where '(/'(•) denotes the digamma function. 

6 Numerical examples 
6.1 Density Estimation 

Setup. We generated 100 data sets for each setting and all results are averaged over that for each data set. 

''J- 0.25) + ^AA(0, 0.5) + ^^Af{dfi, 2),t ^ 1, . . . , N, 

5 10 10 

for < dfi < 5 with N — 500. Examples are shown in Figure [2] 

We compared the relative errors of SUGS to VSUGS under different values of a, i.e. a £ {0.1, 0.5, 1, 5, 10 . . . 45, 50}. 
Throughout T — 200 for VSUGS. We used 50 different (but random) orderings of the data and chose the ordering 



with the maximal variational lower-bound for VSUGS in Section 5.3 and the best ordering for SUGS as in pU] . 
We also used a standard Collapsed Gibbs Sampling method [T7] for posterior inference on some of the datasets for 
comparison. To assess the performance in density estimation we compute the value of 

N 

e = E(/(y^) (12) 

where f(yj), f{yj), are the estimated (predictive) and true density for the data, evaluated at data-point yj and 
examine the relative errors. 
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Results. Figure |4] shows example predictive density estimates from SUGS and VSUGS. Figure [sj^a) shows that 
for large values of a and closely spaced clusters d/i < 1, VSUGS provides more accurate density estimates than 
SUGS. However, for d^> \ and a < 20, i.e. well-separated clusters, the density estimates from SUGS are relatively 
more accurate. 

The computation time for VSUGS is constant for given truncation level T as we use a fixed maximum number 
of mixture components. In contrast, the computation time required for SUGS is variable and depends both on the 
data set and the order in which the data is processed. Figure |3jb) considers the computation burden for the two 
methods. In particular, for large values of a and more mixture components, SUGS can be computationally quite 
demanding due to the excessive numbers of mixture components that are realised. Whilst in practice, one might 
estimate a, this value is not known and hence SUGS could both be significantly less accurate and computationally 
more expensive in many situations. 

We compared the SUGS and VSUGS predictive densities with those obtained from Collapsed Gibbs Sampling, 
we considered the case d/i = 0.2, a = 0.1 and show results in Table [l] for different data sizes N and the truncation 
parameter T. Using Collapsed Gibbs Sampling as a "gold standard", we find that VSUGS consistently provides 
better predictive density estimates. Example computational times for N — 500 were 4 seconds for SUGS, 12 seconds 
for VSUGS (T = 150) and 193 seconds for Collapsed Gibbs Sampling. 



6.2 Density estimation for flow cytometry data 

We analyzed the flow cytometry data example, which has been studied thoroughly by |16j . Flow cytometers detect 
fluorescent reporter markers that typically correspond to specific cell surface or intracellular proteins on individual 
cells, and can assay millions of such cells in a fluid stream in minutes. These data points are associated with one 
(or more) components of a Gaussian mixture model (j6|) and are from human peripheral blood cells, with 6 marker 
measurements each: Forward Scatter (measure of cell size), Side Scatter (Measure of cell granularity), CD4 (marker 
for helper T cells), IFNg+IL-2 (effector crytokines), CDS (a marker for cytotoxic T cells), CDS (marker for all T 
cells); that is, the observations are 6 dimensional (the priors are modified to Normal-inverse Wishart, which leads 



to a similar derivation of the VSUGS algorithm as in Section 5.3 in this multivariate scenario). Our objective is to 
compare the performance of VSUGS to SUGS and Collapsed Gibbs Sampling for clustering and density estimation 
in this multivariate, large data setting. 

Data. The size of the whole data is 50, 000 with 6 dimensions and [T^ state the components of these data 
are centered closely. In the following simulations, we adopted a Gamma prior for a, i.e. a ^ CJ(1, 1) for the three 
approaches. When considering a as unknown we use the approach to handling uncertainty in a described in |20| for 
all algorithms. The Collapsed Gibbs sampler was run for a 300 iteration burn-in followed by 1000 iterations. This 
low number is adopted due to the size and complexity of the data; these type of data scenarios are exactly those 
which motivate the development of SUGS and VSUGS algorithms. For the VSUGS approximation, the truncation 
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value T is set to be 40 (we did not find significant differences in our results when T is increased or decreased by 
around 10). We chose the permutation of the order of the data for VSUGS and SUGS as in the previous example. 

Results. We first compared the computation time for the three method with N ~ 1,000 data points randomly 
choose from the whole data set. This process is repeated for 100 times and we took the average value of the 
time cost. The analyses through Collapsed Gibbs sampling were completed in approximately 509 seconds while 
approximately 8 seconds and 14 seconds were required for SUGS and VSUGS respectively. 

Next, we choose another data sample of 49, 000 data points. We were interested in the performance of all 
approaches in clustering and density estimation (i.e. the predictive density). The predictive density is calculated 
on the remaining 1, 000 data points; the Collapsed Gibbs Sampler analysis was repeated 30 times. The performance 
of predictive density estimates obtained by the three approaches are shown in Table [2j The Collapsed Gibbs 
sampling method has the greatest predictive ability with VSUGS showing greater predictive power than SUGS. 
This illustrates that the VSUGS approximation is performing better than SUGS with regard to density estimation 
and provides an efficient way of detecting and drawing inferences about rare populations in the presence of very 
large datasets. Figure [S] shows that SUGS has difficulty approximating the data density whilst our VSUGS approach 
better approximates the density estimates by Gibbs Sampling. 

6.3 SNP Genotyping 

We now turn to our original motivating SNP genotyping example and examined the use of VSUGS and SUGS for 
a hierarchical Bayesian clustering problem. 

Data. For our experiments, we considered a genotyping dataset that were considered in a recent comparison 
study [9]. The study consists of 6 different individuals, each individual was genotyped three times using the Illumina 
HumanHap650 genotyping array which produces approximately 650,000 two-dimensional measurements per sample. 
We normalised the data by taking log2 transforms and performed quantile normalisation between the two channels 
to correct for allele-specific biases. 

Model. We clustered the data using a three-class Bayesian mixture model: 

Y,\X, ^ P(-|XO, 
X,\w ~ 7W(i(;i:3), 

where A^(u'i:3) is the multinomial distribution, we fixed wi = W2 = = 1/3 and the class conditional density 
P{-\X) is given by a Dirichlet Process Mixture of Bivariate Normal Distributions (one DPM for each genotype). 
We implemented the model using both the SUGS and VSUGS approaches to fit the DPMs. 

For comparison, we classified the genotyping data using a standard genotyping tool, GenoSNP |9J which models 
the class-conditional densities using multivariate Student-t distribution and also performs inference using variational 
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methods. We used majority vote over the three repHcates per sample to obtain the true genotypes from the GenoSNP 
genotype calls. 

Results. Over the 6 x 3 = 18 samples, the average concordance of our VSUGS implementation was 99.45% 
compared to 98.90% for the SUGS implementation. Figure [6] illustrates genotyping performance for one particular 
sample. Figure [6f^c) indicates that, using genotype calls from GenoSNP as a reference, VSUGS produced the 
highest concordance with the GenoSNP results across a range of GenoSNP call probability thresholds. For the 
SNPs with discordant genotype calls between GenoSNP and SUGS/VSUGS, we plotted the empirical distribution 
of the maximum genotype call probabilities for these SNPS. Figure |6|^d) shows that for VSUGS the discordant 
genotype calls were associated with SNPs where the maximum genotype classification probability was around 0.5. 
With SUGS, discordant calls have probabilities in excess of 0.5. 

7 Summary 

In this paper we have considered VSUGS as a generalization of the SUGS algorithm for fast inference from DPM 
models. We saw that when the components of the mixture appear to be close in some sense, VSUGS seems to 
consistently outperform SUGS with regards to density estimation and this improvement is also found by using 
our variational lower-bound for model selection. In addition, when a grows, we have found VSUGS performs 
significantly better, with less computation time. We have found that for real data examples, VSUGS can detect 
features of the data which SUGS cannot. 

In terms of extensions to our work, we are currently considering the development of VSUGS for new models. In 
particular, we are developing the ideas for hierarchical mixture models and infinite hidden Markov models. These 
initial experiments suggests that VSUGS can prove to be a very efficient tool for fast, but approximate, inference 
from a wide class of statistical models. 
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Table 1: Relative error of density estimates of SUGS and VSUGS to Collapsed Gibbs Sampling. 
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Table 2: Log predictive probability on 1, 000 test data points (49,000 training samples) obtained through Collapsed 
Gibbs sampHng, SUGS and VSUGS. 
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Figure 1: (a) An example SNP genotyping dataset showing three genotype classes AA, AB and BB and (b) a 
transformation using Principal Component Analysis of the same data in (a) . 
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Figure 2: Example probability densities for four simulated mixture datasets with d/i = 0.2,0.5, 1.0 and 2.0. 



(a) (b) 
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Figure 3: (a) (Log) relative error of SUGS to VSUGS as a function of {d^,a). (b) Computational times for Gibbs 
Sampling, SUGS and VSUGS. Results are averaged over 100 data sets. 
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Figure 5: Flow Cytometry density estimation examples. Scatter plots of the two-dimensional slices of the mul- 
tivariate dataset (grey) and contour plots (black) showing density estimates. The contour plot is the estimated 
distribution through (columns from left to right) collapsed Gibbs sampling, SUGS and VSUGS. 
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Figure 6: Locations of discordant calls between GenoSNP |9j and (a) SUGS, (b) VSUGS. (c) Genotype call concor- 
dance between GenoSNP and the (d) distribution of genotype call probabilities for discordant calls (o) SUGS and 
(□) VSUGS. 
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